Traceroute sampling makes random graphs appear to have power law degree 

distributions 
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The topology of the Internet has typicaUy been measured by samphng traceroutes, which are 
roughly shortest paths from sources to destinations. The resulting measurements have been used 
to infer that the Internet's degree distribution is scale-free; however, many of these measurements 
have relied on sampling traceroutes from a small number of sources. It was recently argued that 
sampling in this way can introduce a fundamental bias in the degree distribution, for instance, 
causing random (Erdos-Renyi) graphs to appear to have power law degree distributions. We explain 
this phenomenon analytically using differential equations to model the growth of a breadth-first tree 
in a random graph G{n,p = c/n) of average degree c, and show that sampling from a single source 
gives an apparent power law degree distribution P(k) ~ for k < c. 



I. INTRODUCTION 

The Internet and the networks it facilitates — includ- 
ing the Web and email networks — are the largest arti- 
ficial complex networks in existence, and understanding 
their structural and dynamic properties is important if 
we wish to understand social and technological networks 
in general. Moreover, efforts to design novel dynamic 
protocols for communication and fault tolerance are well 
served by knowing these properties. 

One structural property of particular interest is the 
degree distribution at the router level of the Internet. 
This distribution has been inferred 0, S S S H S 
both by sampling traceroutes, i.e., the paths chosen by 
Internet routers, which approximate shortest paths in 
the network, and by taking "snapshots" of BGP (Bor- 
der Gateway Protocol) routing tables [1| . These methods 
have been criticized as be ing noisy and imperfect 0, 0| . 
However, Lakhina et al. recently argued that such 
methods have a more fundamental flaw. Due to the fact 
that these methods use a small number of sources for the 
inference, they were able to show that only a small frac- 
tion of the edges of a graph are "visible" in such a sample. 
Moreover, the set of visible edges is biased in such a way 
that Erdos-Renyi random graphs |l2j | , whose underlying 
distribution is Poisson, will appear to have a power law 
degree distribution. While no one would argue that the 
Internet is a purely random graph, this certainly calls 
into question the standard measurements of power law 
or "scale-free" degree distributions on the Internet, and 
reopens the problem of characterizing Internet topology. 

In this paper we explain this bias phenomenon analyt- 
ically. Specifically, by modeling the growth of a breadth- 
first spanning tree with differential equations, we show 
that sampling shortest paths from a single source in an 
Erdos-Renyi random graph gives rise to a power law de- 
gree distribution of the form P{k) ~ 1/fc, up to a cutoff 
k c where c is the average degree of the underlying 
graph. While sampling traceroutes from a single source 



is rather limited, Barford jl3j provides empirical evidence 
that, on the Internet, merging shortest paths from sev- 
eral sources leads to only marginally improved surveys of 
Internet topology. In the Conclusions we discuss gener- 
alizing our approach to sampling from multiple sources. 

Finally, even if the Internet has a power-law degree 
distribution, the exponent may be rather different from 
the one observed in traceroute samples. In future work 
we plan to extend our approach to graphs with arbitrary 
degree distributions, to study the relationship between 
the observed exponent and the underlying one. 



II. INTERNET SPANNING TREES 

Most mapping projects have inferred the Internet's de- 
gree distribution by implicitly building a map of the net- 
work from the union of a large number of traceroutes 
from a single source, or from a small number of sources. 
Assuming that traceroutes are shortest paths, sampling 
from a single source is equivalent to building a spanning 
tree. We therefore model this sampling method by mod- 
eling the growth of a spanning tree on a graph. 

There are several ways one might build a spanning tree. 
We will consider a family of methods, in which at each 
step, every vertex in the graph is labeled reached, pend- 
ing, or unknown. The pending vertices are the leaves 
of the current tree, the reached vertices are those in its 
interior, and the unknown vertices are those not yet con- 
nected. To model traceroutes from a single source, we 
initialize the process by labeling the source vertex pend- 
ing, and all other vertices unknown. 

Then the growth of the spanning tree is given by the 
following pseudocode. 

while there are pending vertices: 
choose a pending vertex v 
label V reached 

for every unknown neighbor u oi v, 
label u pending. 
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The type of spanning tree is determined by how we 
choose which pending vertex v we wih use to extend the 
tree. To model shortest paths, we store the pending ver- 
tices in a queue, take them in FIFO (first-in, first-out) 
order, and build a breadth-first tree; if we like we can 
break ties randomly between vertices of the same age in 
the queue, which is equivalent to adding a small noise 
term to the length of each edge as in If we store 

the pending vertices on a stack and take them in LIFO 
(last-in, first-out) order, we build a depth-first tree. Fi- 
nally, we can choose from among the pending vertices 
uniformly at random, giving a "random-first" tree. 

Surprisingly, while these three processes build differ- 
ent trees, and traverse them in different orders, we will 
see in the next section that they all yield the same de- 
gree distribution when n is large. To illustrate this. 
Fig. n shows empirical degree distributions of breadth- 
first, depth- first, and random- first spanning trees for a 
random graph G{n,p — c/n) where n — 10^ and c = 100. 
The three degree distributions are indistinguishable; fur- 
ther, they are well-matched by the analytic results given 
in the next section, and obey a power law with exponent 
— 1 for degrees less than c. For comparison, we also show 
the Poisson degree distribution of the underlying graph. 
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FIG. 1: A log-log plot of sampled degree distributions from 
breadth-first, depth-first and random-first spanning trees on a 
random graph of size n = 10^ and average degree c = 100, and 
our analytic results (the black dots). The agreement between 
the three types of trees and our analytic results is extremely 
good. For comparison, the black line shows the Poisson de- 
gree distribution of the underlying graph. Note the power 
law behavior of the apparent degree distribution for k, which 
extends up to a cutofi' at ~ c. 



III. ANALYTIC RESULTS 

In this section we show analytically that building span- 
ning trees in Erdos-Renyi random graphs G(n,p = c/n), 
using any of the processes described above, gives rise to 



an apparent power law degree distribution P{k) ~ 1/fc 
for k < c. We focus here on the case where the average 
degree c is large, but constant with respect to n; we be- 
lieve our results also hold if c is a moderately growing 
function of n, such as logrt or for small e, but it seems 
more difficult to make our analysis rigorous in that case. 

We will model the progress of the while loop described 
above. Let S{T) and U{T) denote the number of pend- 
ing and unknown vertices at step T respectively. The ex- 
pected changes in these variables at each step are (where 
E[-] denotes the expectation) 



E[U{T- 
E[S{T- 



UiT)] 

-Sin 



-pU{T) 
pU{T) - 



1 



(1) 



Here the pU [T) terms come from the fact that a given 
unknown vertex u is connected to the chosen pending 
vertex v with probability p, in which case we change its 
label from unknown to pending; the —1 term comes from 
the fact that we also change w's label from pending to 
reached. Moreover, these equations apply no matter how 
we choose v; whether v is the "oldest" vertex (breadth- 
first), the "youngest" one (depth-first), or a random one 
(random-first). By the principle of deferred decisions, 
the events that v is connected to each unknown vertex 
u are independent and occur with probability p. Our 
experiments do indeed show that these three processes 
result in the same degree distribution. 

Writing t = T/n, s{t) = S{tn)/n and u{tn) = U{t)/n, 
the difference equations ^ become the following system 
of differential equations. 



du 

dt 
ds 

dt 



cu — 1 



With the initial conditions u{0) = 1 and s(0) = 
solution to l(2Jl is 



(2) 



= 0, the 



s(t) ^l-t 



(3) 



The algorithm ends at the smallest positive root i/ of 
s{t) = 0; using Lambert's function W, defined as W{x) — 
y where ye^ = x, we can write 



c 



ce 



=) 



(4) 



Note that is the fraction of vertices which are reached 
at the end of the process, and this is simply the size of 
the giant component of G{n,c/n). 

Now, we wish to calculate the degree distribution P{k) 
of this tree. The degree of each vertex v is the number 
of its previously unknown neighbors, plus one for the 
edge by which it became attached (except for the root). 
Now, if V is chosen at time i, in the limit n — > cx) the 
probability it has k unknown neighbors is given by the 
Poisson distribution with mean m = cu{t), 

Poisson(m, k) - 



k\ 
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Averaging over all the vertices in the tree gives 



IV. DISCUSSION AND CONCLUSIONS 



P{k + 1) 
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dt Poisson(cu(t), /c) 



It is helpful to change the variable of integration to m. 
Since m = ce~'^* we have dm = — cradi, and 



P(fc + 1) = 



1 



dm 



Poisson(m, k) 



f Jc{l-ti) 



Poisson(m, k) 
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cm 

m _ — 1 



(5) 



Here in the second line we use the fact that ~ 1 — e"'^ 
when c is large (i.e., the giant component encompasses 
almost all of the graph) . 

The integral in (O is given by the difference between 
two incomplete Gamma functions. However, since the 
integrand is peaked at m = fc — 1 and falls off exponen- 
tially for larger m, for fc < c it coincides almost exactly 
with the full Gamma function r(fc). Specifically, for any 
c > we have 



dm e 



'^m^ ^ < ce" 



and, if fc — 1 = c(l — e) for e > 0, then 

da;e"^(l + x/c) 



te m = e c 
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This is o(r(fc)) if e > l/Vfe, i.e., if A; < c — c" for some 
a > 1/2. In that case we have 



P(fc + l) = (l-o(l)) 



m 

ckl 



1 

ck 



(6) 



giving a power law of exponent — 1 up to fc ^ c. 

Although we omit some technical details, this deriva- 
tion can be made mathematically rigorous using results 
of Wormald 0|, who showed that under fairly generic 
conditions, the state of discrete stochastic processes like 
this one is well-modeled by the corresponding rescaled 
differential equations. Specifically it can be shown that 
if we condition on the initial source vertex being in the 
giant component, then with high probability, for all t 
such that < t < tf, U{tn) = u{t)n + o{n) and 
S{tn) — s{t)n + o{n). It follows that with high probabil- 
ity our calculations give the correct degree distribution 
of the spanning tree within o(l). 



Lakhina et al. pjj. argued that sampling traceroutes 
from a small number of sources profoundly underesti- 
mates the degrees of vertices far from the source, and 
that this effect can cause graphs to appear to have a 
power law degree distribution even when their underly- 
ing distribution is Poisson. In this work, we modeled this 
sampling process as the construction of a spanning tree, 
and showed analytically that for sparse random graphs 
of large average degree, the apparent degree distribution 
does indeed obey a power law of the form P{k) ^ k^^ for 
k below the average degree. This illustrates the danger 
of concluding the existence of a power-law from data over 
too small a range of degrees, and, more specifically, the 
danger of sampling traceroutes from just a few sources. 
While our analytic results hold for traceroutes from a 
single source, we conjecture that if we use any constant 
number of sources and take the union of the resultant 
spanning trees, the observed distribution will remain sub- 
ject to the same sampling bias, and one will again observe 
a power-law degree distribution. 

Certainly this exponent —1 is not the one observed 
for the Internet. For instance, Faloutsos, Faloutsos and 
Faloutsos m observed degree distributions P{k) ~ fc~^-^ 
at the router level, and fc~^-^ for out-degrees at the inter- 
domain (BGP) level. While the real degree distribution 
of the Internet may be a power law, one possibility is that 
the true exponent is rather different from the observed 
one; this possibility was recently explored by Peterman 
and de los Rios [l5| (they also study another mechanism 
for observing apparent power laws in the degree distribu- 
tion, which samples the original graph via a probabilistic 
pruning strategy and gives an observed exponent of —2 
for random graphs). In future work we will extend our 
differential equation model to random graphs with power 
law degree distributions to explore this possibility. 

Although our analytic model of spanning trees illus- 
trates one possible mechanism for apparent power-law 
degree distributions, another possibility is that decisions 
(such as those made by the border-gateway protocol) 
made by actual Internet routers interact in complex ways 
with the the underlying link-level topology. It may well 
be that routers only use a small fraction of the edges in 
the network, and that the edges they actually use give 
rise to an effective degree distribution with roughly a 
power-law form. This possibility implies that even if the 
real degree distribution of the Internet is something very 
different from a power law, it may not matter as these 
extra links are not utilized in normal routing decisions. 

Another difference of unknown, but perhaps small, 
significance is that on the Internet, routing algorithms 
are intended to optimize for end-to-end traffic flow, and 
routers in the "interior" of the Internet such as those on 
the backbone are not themselves destinations or sources. 
Unlike the empirical studies cited earlier, we selected 
sources uniformly at random for our spanning trees. 

The story of the topology of the Internet is far from 
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over and will undoubtedly remain a topic of great interest 
for many years to come. However, knowing decisively 
that current methods are fundamentally biased will serve 
to push the state-of-the-art forward toward more robust 
methods of characterization. We see the following three 
lines of inquiry as enticing: 

First, having analytically shown that a Poisson degree 
distributions can lead to apparent power laws, we natu- 
rally wish to generalize our approach to random graphs 
with arbitrary degree distributions, characterize the fam- 
ily of distributions which generate apparent power laws, 
and understand the relationship between observed and 
underlying exponents for power law distributions. Hav- 
ing this generalization may allow us to make firm and 
useful claims about the topology of the Internet. 

Secondly, in what manner is the traceroute-sampled 
degree distribution of a random graph dependent on the 
length of the paths taken between a source and destina- 
tion? If we simulate a "crawl" by using random walks 



rather than short paths, how would the observed degree 
distribution change? 

Finally, if we use build spanning trees from m sources 
and take their unions, how does the observed degree dis- 
tribution vary with m? If we are correct that any con- 
stant m leads to the same kind of bias, how does m need 
to grow with the network to obtain an accurate sample? 
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